Evolutionary Instance Resampling for Difficult Data Sets
نویسندگان
چکیده
In the field of machine learning, data set features such as across-class imbalance and class overlap often pose difficulties for classifier algorithms. A number of methods alleviate these difficulties by adjusting the distribution of the data set before classifier construction. Resampling is typically effected by re-weighting, removing, or duplicating instances. Finding a good distribution for the data set, however, is a nontrivial problem. Evolutionary algorithms are frequently used to search for solutions in large, difficult search spaces. In this thesis, four evolutionary approaches are applied to the problem of instance resampling across a variety of data sets and classifier paradigms. In many cases, the evolutionary pre-processing methods are able to produce better classifiers. In particular, an integer-based, one-to-one representation and a cluster-based, real-valued weighting scheme are shown to be beneficial for improving classifier performance on difficult data sets. Index words: genetic algorithms, machine learning, imbalance, undersampling, oversampling, instance selection Evolutionary Instance Resampling for Difficult Data Sets
منابع مشابه
Evolutionary rule-based systems for imbalanced data sets
This paper investigates the capabilities of evolutionary online rule-based systems, also called Learning Classifier Systems (LCSs), for extracting knowledge from imbalanced data. While some learners may suffer from class imbalances and instances sparsely distributed around the feature space, we show that LCSs are flexible methods that can be adapted to detect such cases and find suitable models...
متن کاملAn Improved Algorithm for SVMs Classification of Imbalanced Data Sets
Support Vector Machines (SVMs) have strong theoretical foundations and excellent empirical success in many pattern recognition and data mining applications. However, when induced by imbalanced training sets, where the examples of the target class (minority) are outnumbered by the examples of the non-target class (majority), the performance of SVM classifier is not so successful. In medical diag...
متن کاملCredit Card Fraud Detection using Data mining and Statistical Methods
Due to today’s advancement in technology and businesses, fraud detection has become a critical component of financial transactions. Considering vast amounts of data in large datasets, it becomes more difficult to detect fraud transactions manually. In this research, we propose a combined method using both data mining and statistical tasks, utilizing feature selection, resampling and cost-...
متن کاملA Study on the Combination of Evolutionary Algorithms and Stratified Strategies for Training Set Selection in Data Mining
Evolutionary algorithms are adaptive methods based on natural evolution that may be used for search and optimization. As Training Set Selection can be viewed as a search problem, it could be solved using evolutionary algorithms. In this paper, we have carried out an empirical study of the performance of CHC as representative evolutionary algorithm model. This study includes a comparison between...
متن کاملTime-stamped resampling for robust evolutionary portfolio optimization
Traditional mean-variance financial portfolio optimization is based on two sets of parameters, estimates for the asset returns and the variance-covariance matrix. The allocations resulting from both traditional methods and heuristics are very dependent on these values. Given the unreliability of these forecasts, the expected risk and return for the portfolios in the efficient frontier often dif...
متن کامل